See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/331978642

Article in ELT Journal · January 2019

DOI: 10.1093/elt/ccy044

CITATIONS

READS

7,177

1 author:

Nicky Hockly

The Consultants-E

SEE PROFILE

31 PUBLICATIONS 1,129 CITATIONS

Some of the authors of this publication are also working on these related projects:

Book chapter: 'Assessment of online language education programs' View project

All content following this page was uploaded by Nicky Hockly on 29 April 2019.

The user has requested enhancement of the downloaded file.

technology for the language teacher

Automated writing evaluation

Nicky Hockly

Introduction

In this series, we explore technology-related themes and topics. The series aims to discuss and demystify what may be new areas for some readers and to consider their relevance for English language teachers.

Developments in technology, particularly in the field of natural language processing (NLP) and latent semantic analysis (LSA), have led to the increased uptake of software that can automatically analyse student writing. Known as automated writing evaluation (AWE), this software works by comparing a written text to a large database of writing of

the same genre, written in answer to a specific prompt or rubric. The software then analyses measurable features in a text, such as syntax, text complexity, total word count, and vocabulary range, through statistical modelling and algorithms, and the text is given an overall score; this score can be supported by suggestions for improvements, and/or standardized feedback on the writing style. In the field of ELT, popular AWE platforms currently include Criterion, developed by ETS, and Write&Improve

from Cambridge English; the latter is freely available online. Other well- known AWE programs are MY Access! from Vantage Learning, and WriteToLearn from Pearson.

AWE can be used for summative assessment, in placement tests, for formative assessment, and/or in writing instruction. AWE software such as Criterion or Write&Improve provides immediate feedback on a text in an interactive format: a student can scroll through his/her text, stop at sections that the software has highlighted, and read comments on his/ her specific language choices. Some AWE programmes allow teachers

to add additional feedback comments on the text, or to tailor the general feedback provided by the software on a piece of writing. Some AWE tools also include a wrap-around learning management system that enables students to upload work and create writing portfolios, and teachers can track a student’s writing progress and grades. Some AWE tools give students access to writing samples and online dictionaries, and teachers can have access to additional aids such as plagiarism detection.

The effectiveness of AWE software is a topic of some debate among both researchers and classroom practitioners, with advocates viewing

Downloaded from https://academic.oup.com/eltj/advance-article-abstract/doi/10.1093/elt/ccy044/5161217 by Nicky Hockly on 06 November 2018

ELT Journal; doi:10.1093/elt/ccy044 Page 1 of 7

AWE for summative assessment

AWE as an excellent tool for improving students’ writing, and detractors unconvinced that machines can reliably score creative written output.

Downloaded from https://academic.oup.com/eltj/advance-article-abstract/doi/10.1093/elt/ccy044/5161217 by Nicky Hockly on 06 November 2018

There is also some debate over terminology. AWE is referred to in the literature as Automated Essay Scoring (AES), Automated Essay Evaluation (AEE) and Automated Essay Grading (AEG). However, by using the word ‘essay’, the latter three terms exclude the wide range of writing genres that students typically produce, and the use of the words ‘scoring’ and ‘grading’ suggest summative assessment, whereas the software can also be used for formative assessment (Elliot et al. 2013). Indeed, terms such as AES are most often used in the context of high-stakes testing, whereas the more general term AWE is most frequently associated with ‘lower-stakes writing

instruction’ (Grimes and Warshauer 2010: 5). In this article I focus primarily on AWE as a tool for providing formative feedback to improve the quality of students’ writing. This is the area that appears to hold most potential for AWE in the context of ELT and English language learning, and a substantial body of research into AWE in ELT focuses on the formative potential of AWE.

Dating back to the 1960s, the use of AWE has become increasingly common in large-scale assessments over the last decade. The introduction of the Common Core Standards in the USA, with its emphasis on standardized testing, has created a growing market for computer-based testing solutions. For example, Pearson’s Intelligent Essay Assessor graded approximately 34 million student essays for state and national tests in the USA in 2017 (Smith 2018).

The use of AWE for summative assessment is controversial. The immediate benefits of AWE are obvious: the automatic grading of a student’s writing by a software program is infinitely faster and cheaper than using human evaluators; hence the attraction of using automatic scoring in high-stakes testing where the writing of large numbers of students needs to be assessed in a short time. In summative assessment, a major focus of the research has been to what extent AWE programs produce scores for student writing that are comparable to the scores given by human evaluators, as a key measure of reliability. On balance, the studies that compare human and machine scoring of essays show some degree of consistency between the grades awarded by a human scorer and AWE programs, despite some notable exceptions (see Elliot et al. 2013).

However, critics point to the importance of writing as a socially embedded process, and question the ability of AWE software to judge critical thinking, rhetorical knowledge, creativity, or a student’s ability to tailor their text to a specific readership. In a ‘stumping’ case study of

Criterion (i.e. a study designed to illustrate the limitations of the system), Herrington and Moran (2012) demonstrated how the system was unable to accurately score their own essay written in response to a prompt. Many teachers who have informally tried AWE programs with their own writing can attest to similar experiences.

Critics also claim that students can learn to game the system. Perelman (quoted in Smith 2018), a particularly vocal critic of AWE for testing purposes, has gone further and designed an online tool that generates

AWE for formative purposes

nonsensical texts based on essay prompts that can earn top scores from AWE programs. The following extract is taken from a 500-word example text produced by his Babel Generator tool, created by including three words from an AWE program essay prompt:

Downloaded from https://academic.oup.com/eltj/advance-article-abstract/doi/10.1093/elt/ccy044/5161217 by Nicky Hockly on 06 November 2018

History by mimic has not, and presumably never will be precipitously but blithely ensconced. Society will always encompass imaginativeness; many of scrutinizations but a few for an amanuensis. The perjured imaginativeness lies in the area of theory of knowledge but also the field of literature. Instead of enthralling the analysis, grounds constitutes both a disparaging quip and a diligent explanation. (Smith 2018: no page number).

This example earned the top score in one AWE program used in high- stakes testing. It shows a range of complex discourse features and syntax, and a wide range of vocabulary, but as Perelman points out, ‘it makes absolutely no sense’ (ibid.).

The use of AWE for formative purposes—i.e. to support the development of students’ writing—seems to hold more promise, especially in the field of ELT. Research has shown that the use of AWE software for formative purposes can encourage students to review their work (Chapelle 2008), as well as increase students’ motivation for writing (Warschauer and Grimes 2008). A mixed-methods study with English language learners found that the use of Criterion not only led to increased revisions of written work, but

also that the accuracy of that work improved over drafts due to the corrective feedback provided by the AWE software (Li, Link, and Hegelheimer 2015). Although teachers’ attitudes were positive towards the AWE program overall in this study, students’ responses were mixed; the researchers suggest that student attitudes may depend on their language proficiency, as well as their teachers’ attitudes to AWE and the way the software is used with students.

Despite these positive findings, the use of AWE for formative purposes is a complex and nuanced area for researchers and practitioners alike. It has been pointed out, for example, that AWE programs can privilege certain languages and cultural backgrounds. In a case study with first- language writers using Criterion, Herrington and Stanley (2012) found that by demanding linguistic homogeneity in its focus on standard American English, the AWE program was unable to accept effective

rhetorical and stylistic uses of language from alternative traditions derived from class or race. In a study reviewing 17 AWE programs, Vojak, Kline, Cope, McCarthey, and Kalantzis (2011) concluded that despite appealing features, such as immediate feedback and plagiarism detection, AWE systems promote a restricted view of writing that focuses on ‘correctness’, and ignores the cultural and social aspects of writing, as well as providing no outlet for the multimodal expressions of writing that are available in the digital age. Critics suggest that while AWE may be used to measure the surface features of a text (e.g. number of words, average sentence length, or range of vocabulary), richer forms of assessment, such as portfolio assessment, should be used in tandem with AWE to ensure construct validity (e.g. Deane 2013).

Many AWE research studies to date have been carried out in non-ELT contexts, with students writing in their first language, as well as with

Downloaded from https://academic.oup.com/eltj/advance-article-abstract/doi/10.1093/elt/ccy044/5161217 by Nicky Hockly on 06 November 2018

English language learners; however, the writing development needs of these two groups differ. One might argue that in ELT contexts, where students will typically have lower levels of English language proficiency than in first-language contexts, attention to surface textual features such as grammar, vocabulary, and discourse elements are exactly what English language learners can benefit from, especially when AWE feedback is used for formative purposes. Some studies into the effects of AWE on the quality of student writing in EFL academic programmes do indeed show positive effects (e.g. Li et al. 2014), and lower-proficiency students have been found to benefit from automated feedback on areas such as grammar and word choice (Wang, Shang, and Briody 2013; Liao 2016).

However, initially positive findings may prove problematic on closer scrutiny. A study by El Ebyary and Windeatt (2010) found that the use of the Criterion AWE program in a large academic writing class of over 800 Egyptian EFL students increased students’ reflection on their own writing; in addition, the feedback provided by Criterion had a positive effect on

the quality of the writing in students’ revisions and subsequent drafts (a finding replicated by human scoring in the same study). Nevertheless, a close examination of the students’ writing produced in this study showed that higher scores were achieved through avoidance strategies—in other words, students avoided using language that would cause errors in usage, mechanics, or style as measured by Criterion.

Avoidance, and/or the deliberate use of strategies known to affect AWE feedback and scores, is a theme that recurs in the literature. Warschauer and Grimes (2008) found that students tended to make minor changes to their texts (e.g. by correcting spelling, word choice, or grammar) to improve their AWE essay scores rather than addressing more complex issues such as text organization or content. In addition, the researchers uncovered a number of paradoxes in this study: firstly, although students and teachers reported positive attitudes towards AWE, they criticized the accuracy of the software’s scoring and feedback; secondly, the students did not write more frequently despite their positive attitudes; and thirdly, although students and teachers expressed the desire to review and revise their written work during class based on the AWE feedback, no time

was scheduled for students to do so due to the pressure of covering a full curriculum and preparing students for state tests. The researchers highlight in particular the importance of social and contextual factors in mediating the use of AWE programs, including the pressure of standardized testing, diverse student populations, and teachers’ and students’ beliefs and past teaching/learning experiences.

By examining the classroom context in more depth, other studies have found a range of factors that can affect the use of AWE with students. A multi-site, longitudinal analysis of AWE programs in middle schools in the USA (Grimes and Warschauer 2018) found that students

were motivated to write by the programs, and this made classroom management easier for teachers. However, the study revealed the need for students and teachers to critically evaluate the feedback and scores generated by the AWE programs, as well as the need for teachers to use these programs in pedagogically sound ways that included a range of

Validation frameworks

writing genres and activities for their students. The researchers concluded that the success of an AWE application in one of their case studies

Downloaded from https://academic.oup.com/eltj/advance-article-abstract/doi/10.1093/elt/ccy044/5161217 by Nicky Hockly on 06 November 2018

[was] the result of many local factors that are not easy to replicate, including a mature AWE technology, strong administrative support, excellent professional training, teachers ready to experiment with technology, and … strong peer support among teachers. (ibid.: 34)

In a further study, interviews with elementary school teachers in the USA revealed varied attitudes to the effectiveness of AWE software. Teachers’ enthusiasm for AWE programs decreased over time, and those teachers with a high percentage of English language learners in their classes preferred to take advantage of Web 2.0 tools that enable the students to write for authentic audiences (Warschauer 2011).

One concern for researchers is how to validate AWE software and evaluate its fitness for purpose. Kane (2013) proposes an argument-based validation framework that focuses on how AWE is used in practice; the framework suggests the critical evaluation of a network of arguments and inferences in order to assess how effective (or not) a specific AWE program in a specific context may be. The basic idea behind Kane’s validation argument is to test to what extent the claims (or ‘warrants’) and assumptions made by AWE software are borne out in research and empirical evidence.

Chapelle, Cotos, and Lee (2015) propose a validation framework for the use of AWE with L2 writers that addresses seven domains: evaluation (the accuracy of feedback provided by an AWE program), generalization (the extent to which the feedback can be generalized), explanation (the quality of feedback explanations), extrapolation (the extent to which feedback can be extrapolated to other contexts), utilization (how feedback is used by students), ramification (the extent to which learning takes place from AWE use), and domain definition (how a real-world domain of interest is defined by the AWE program).

Some recent research into the use of AWE for formative purposes in the context of English language learning has focused on areas within

these validation frameworks. For example, Ranalli, Link, and Chukharev- Hudilainen (2017) examined the ‘evaluation’ warrant in a study with ESL students using Criterion, and found that despite issues with the accuracy of the feedback, lower-level students were able to take better advantage of the AWE feedback than higher levels, probably due to the focus on form in their writing course. They conclude that the use of Criterion ‘may be more justified on courses where there is a congruent focus on form, supporting previous research showing that the manner in which AWE is integrated into instruction influences its acceptance by students” (ibid.: 28). They also found that specific and detailed feedback on errors was more useful for students than generic feedback. In a second study, these researchers examined the ‘utilization’ warrant to examine to what extent students found Criterion’s diagnostic feedback useful for revising their texts. The students in this study were able to correct errors based on the program’s feedback 55–65% of the time; the researchers point out that whether this ‘middling finding’ (ibid.: 26) is acceptable or not will depend on the aims of the writing task.

Conclusion

The increased use of AWE in both summative and formative writing assessment suggests that this technology is here to stay. However, the current limits of natural language processing and latent semantic analysis technology mean that AWE is not a magic bullet, and research results on the overall effectiveness of AWE and its ability to develop students’ writing are mixed. Stevenson and Phakiti (2014: 51), cite ‘[p]aucity of research, the mixed nature of research findings, heterogeneity of participants, contexts and designs, and methodological issues in some of the existing research

Downloaded from https://academic.oup.com/eltj/advance-article-abstract/doi/10.1093/elt/ccy044/5161217 by Nicky Hockly on 06 November 2018

… as factors that limit our ability to draw firm conclusions concerning the effectiveness of AWE feedback’.

The potential of AWE to support teachers in second or foreign English language writing instruction programmes may hold some promise, however, especially for lower-proficiency learners where a focus on form may be a priority. Nevertheless, the wide range of factors that will affect its use in any specific context can affect to what extent AWE is effective for English language learners; these factors include age, language proficiency, instructional approaches, and beliefs and attitudes to AWE. Some researchers advocate that AWE programs be used in tandem with other evaluation systems, such as writing portfolios, so that instruction and peer review can be integrated into a more holistic approach to writing development and assessment.

For researchers and practitioners interested in exploring the effects of AWE on the development of English language learners’ writing skills, one fruitful area of investigation might involve further research based on the validation frameworks of Kane (2013) and/or Chapelle et al. (2015), as well as replication of some of the more methodologically robust studies cited in this article. In massive open online courses (MOOCs), where large groups of students submit texts for feedback or assessment, AWE programs used in tandem with calibrated peer review also provide a rich potential site for research (Balfour 2013). If one thing is clear, it is that further research into AWE in the field of ELT is needed.

Final version received September 2018

References Balfour, S. 2013. ‘Assessing writing in MOOCs: automated essay scoring and calibrated peer review’. Research and Practice in Assessment 8: 40–8.‌

Chapelle, C. A. 2008. ‘Utilizing technology in language assessment’ in E. Shohamy (ed.).

Encyclopedia of Language Education (Second edition). Heidelberg: Springer.

Chapelle, C. A., E. Cotos, and J. Y. Lee. 2015. ‘Validity arguments for diagnostic assessment using automated writing evaluation’. Language Testing 32: 385–405.

Deane, P. 2013. ‘On the relationship between automated essay scoring and modern views of the writing construct’. Assessing Writing 18/1: 7–24.

El Ebyary, K. and S. Windeatt. 2010. ‘The impact of computer-based feedback on students’ written work’. International Journal of English Studies 10/2: 122–44.

Elliot, N., A. Ruggles Gere, G. Gibson, C. Toth,

C. Whithaus, and A. Presswood. 2013. ‘Use and limitations of automated writing evaluation software’. WPA-CompPile Research Bibliographies 23. Available at http://comppile.org/wpa/bibliographies/ Bib23/AutoWritingEvaluation.pdf (accessed on 10 September 2018).

Grimes, D. and M. Warschauer. 2010. ‘Utility in a fallible tool: a multi-site case study of automated writing evaluation’. Journal of Technology, Learning, and Assessment 8/6: 1–43.

Herrington, A. and S. Stanley. 2012. ‘Criterion: promoting the standard’ in A. B. Inoue and M. Poe (eds.). Race and Writing Assessment. New York: Peter Lang, pp. 47–61.

Herrington, A. and C. Moran. 2012. ‘Writing to a machine is not writing at all’ in N. Elliot and

L. Perelman (eds.). Writing Assessment in the Twenty- First Century: Essays in Honor of Edward M. White. New York: Hampton Press, pp. 219–32.

Kane, M. T. 2013. ‘Validating the interpretation and uses of test scores’. Journal of Educational Measurement 50/1: 1–73.‌

Li, Z., S. Link, H. Ma, H. Yang, and V. Hegelheimer. 2014. ‘The role of automated writing evaluation holistic scores in the ESL classroom’. System 44: 66–78.

Li, J., S. Link, and V. Hegelheimer. 2015. ‘Rethinking the role of automated writing evaluation (AWE) feedback in ESL writing instruction’. Journal of Second Language Writing 27: 1–18.

Liao, H.-C. 2016. ‘Using automated writing evaluation to reduce grammar errors in writing’. ELT Journal 70/3: 308–19.‌

Ranalli, J., S. Link, and E. Chukharev-Hudilainen. 2017. ‘Automated writing evaluation for formative assessment of second language writing: investigating the accuracy and usefulness of feedback as part of argument-based validation’. Educational Psychology 37/1: 8–25.

Shohamy, E. (ed.). 2008. Encyclopedia of Language Education (Second edition). Heidelberg: Springer. Smith, T. 2018. ‘More states opting to “robo-grade” student essays by computer’. National Public Radio website. Available at https://www.npr.

org/2018/06/30/624373367/more-states-opting-to- robo-grade-student-essays-by-computer (accessed on 10 September 2018).

Stevenson, M. and A. Phakiti. 2014. ‘The effects of computer-generated feedback on the quality of writing’. Assessing Writing 19: 51–65.

Vojak, C., S. Kline, B. Cope, S. McCarthey, and

M. Kalantzis. 2011. ‘New spaces and old places: an analysis of writing assessment software’. Computers and Composition 28/2: 97–111.

Wang, Y.-J., H.-F. Shang, and P. Briody. 2013. ‘Exploring the impact of using automated writing evaluation in English as a foreign language university students’ writing’. Computer Assisted Language Learning 26/3: 234–57.

Warschauer, M. 2011. Learning in the Cloud: How (and Why) to Transform Schools with Digital Media. New York: Teachers College Press.‌

Warschauer, M. and D. Grimes. 2008. ‘Automated writing assessment in the classroom’. Pedagogies: An Intermational Journal 3/1: 22–36.

The author Nicky Hockly is the Director of Pedagogy of The Consultants-E (www.theconsultants-e.com), an award-winning online training and development organization. She has been involved in EFL teaching and teacher training since 1987 and is

co-author of How to Teach English with Technology, Learning English as a Foreign Language for Dummies, Teaching Online, Digital Literacies, and Going Mobile, and sole author of Focus on Learning Technologies (2016) and ETpedia Technology (2017). She trains, writes, and consults on the principled application of new technologies to language teaching.

Email: nicky.hockly@theconsultants-e.com

Downloaded from https://academic.oup.com/eltj/advance-article-abstract/doi/10.1093/elt/ccy044/5161217 by Nicky Hockly on 06 November 2018

Automated writing evaluation Page 7 of 7

View publication stats